Data Visualization in Python

Matplotlib, Seaborn and Plotly

(Notebook original from: https://www.kaggle.com/vanshjatana/a-simple-tutorial-to-data-visualization)

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions.

Catalogue

Import libraries

Numpy - https://numpy.org \ The fundamental package for scientific computing with Python.

\ Pandas - https://pandas.pydata.org \ Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

\ Matplotlib - https://matplotlib.org \ Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

\ seaborn - https://seaborn.pydata.org \ Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

\ Plotly - https://plotly.com/python \ Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

Data

\ Iris Species - https://www.kaggle.com/uciml/iris \ Classify iris plants into three species in this classic dataset.

\ Titanic: Machine Learning from Disaster - https://www.kaggle.com/c/titanic \ Start here! Predict survival on the Titanic and get familiar with ML basics.

Campus Recruitment - https://www.kaggle.com/benroshan/factors-affecting-campus-placement \ Academic and Employability Factors influencing placement.

\ House Prices: Advanced Regression Techniques - https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data \ Predict sales prices and practice feature engineering, RFs, and gradient boosting.

\ Predict Future Sales - https://www.kaggle.com/c/competitive-data-science-predict-future-sales \ Final project for "How to win a data science competition" Coursera course.

\ Tesla stock data from 2010 to 2020 - https://www.kaggle.com/timoboz/tesla-stock-data-from-2010-to-2020 \ How did TSLA do since its inception?

\ Novel Corona Virus 2019 Dataset - https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset \ Day level information on covid-19 affected cases.

\ Global Hospital Beds Capacity (for covid-19) - https://www.kaggle.com/ikiulian/global-hospital-beds-capacity-for-covid19 \ Baseline for understanding the typical hospitals bed capacity globally

\ Natural Language Processing with Disaster Tweets - https://www.kaggle.com/c/nlp-getting-started \ Predict which Tweets are about real disasters and which ones are not.

Bar Plot

A barplot (or barchart) is one of the most common type of plot. It shows the relationship between a numerical variable and a categorical variable. For example, you can display the height of several individuals using bar chart. Barcharts are often confounded with histograms, which is highly different. (It has only a numerical variable as input and shows its distribution).

Vertical Bar

Horizontal Bar

Very useful to show rankings.

Stacked Bar

Group Bar

Count Plot

Show the counts of observations in each categorical bin using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.

Histogram

A Histogram visualises the distribution of data over a continuous interval or certain time period. Each bar in a histogram represents the tabulated frequency at each interval/bin.

Histograms help give an estimate as to where values are concentrated, what the extremes are and whether there are any gaps or unusual values. They are also useful for giving a rough view of the probability distribution.

2d Histogram

Marginal Histogram

Facet Histogram

With Box Margin

With Violin Margin

Density Plot

A density plot is a representation of the distribution of a numeric variable. It uses a kernel density estimate to show the probability density function of the variable (see more). It is a smoothed version of the histogram and is used in the same concept.

Pie Plot

A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is proportional to the quantity it represents. While it is named for its resemblance to a pie which has been sliced, there are variations on the way it can be presented.

Donut Plot

A donut chart is essentially a Pie Chart with an area of the centre cut out. Pie Charts are sometimes criticised for focusing readers on the proportional areas of the slices to one another and to the chart as a whole. This makes it tricky to see the differences between slices, especially when you try to compare multiple Pie Charts together. A Donut Chart somewhat remedies this problem by de-emphasizing the use of the area. Instead, readers focus more on reading the length of the arcs, rather than comparing the proportions between slices. Also, Donut Charts are more space-efficient than Pie Charts because the blank space inside a Donut Chart can be used to display information inside it.

Tree Plot

Treemaps are an alternative way of visualising the hierarchical structure of a Tree Diagram while also displaying quantities for each category via area size. Each category is assigned a rectangle area with their subcategory rectangles nested inside of it.

When a quantity is assigned to a category, its area size is displayed in proportion to that quantity and to the other quantities within the same parent category in a part-to-whole relationship. Also, the area size of the parent category is the total of its subcategories. If no quantity is assigned to a subcategory, then it's area is divided equally amongst the other subcategories within its parent category.

It visualize hierarchical data spanning outwards radially from roots to leave. The roots start from centreand children only.

Sunburst Plot

This type of visualisation shows hierarchy through a series of rings, that are sliced for each category node. Each ring corresponds to a level in the hierarchy, with the central circle representing the root node and the hierarchy moving outwards from it.

Rings are sliced up and divided based on their hierarchical relationship to the parent slice. The angle of each slice is either divided equally under its parent node or can be made proportional to a value.

Colour can be used to highlight hierarchal groupings or specific categories.

Scatter Plot

Also known as a Scatter Graph, Point Graph, X-Y Plot, Scatter Chart or Scattergram.

Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two variables. By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists.

Trendline

Multiple Lines

LM Plot

Plot data and regression model fits.

Resid Plot

Plot the residuals of a linear regression.

This function will regress y on x (possibly as a robust or polynomial regression) and then draw a scatterplot of the residuals. You can optionally fit a lowess smoother to the residual plot, which can help in determining if there is structure to the residuals.

Ternary Plot

A ternary plot, ternary graph, triangle plot, simplex plot, Gibbs triangle or de Finetti diagram is a barycentric plot on three variables which sum to a constant. It graphically depicts the ratios of the three variables as positions in an equilateral triangle. It is used in physical chemistry, petrology, mineralogy, metallurgy, and other physical sciences to show the compositions of systems composed of three species. In population genetics, a triangle plot of genotype frequencies is called a de Finetti diagram. In game theory, it is often called a simplex plot. Ternary plots are tools for analyzing compositional data in the three-dimensional case.

Line Chart

Line Graphs are used to display quantitative values over a continuous interval or time period. A Line Graph is most frequently used to show trends and analyse how the data has changed over time.

Line Graphs are drawn by first plotting data points on a Cartesian coordinate grid, then connecting a line between all of these points. Typically, the y-axis has a quantitative value, while the x-axis is a timescale or a sequence of intervals. Negative values can be displayed below the x-axis.

The direction of the lines on the graph works as a nice metaphor for the data: an upward slope indicates where values have increased and a downward slope indicates where values have decreased. The line's journey across the graph can create patterns that reveal trends in a dataset.

Bubble Plot

A Bubble Chart is a multi-variable graph that is a cross between a Scatterplot and a Proportional Area Chart. Like a Scatterplot, Bubble Charts use a Cartesian coordinate system to plot points along a grid where the X and Y axis are separate variables. However. unlike a Scatterplot, each point is assigned a label or category (either displayed alongside or on a legend). Each plotted point then represents a third variable by the area of its circle. Colours can also be used to distinguish between categories or used to represent an additional data variable. Time can be shown either by having it as a variable on one of the axis or by animating the data variables changing over time.

Calendar Plot

A calendar plot is a visualization used to show activity over the course of a long span of time, such as months or years. They're best used when you want to illustrate how some quantity varies depending on the day of the week, or how it trends over time.

Box Plot

A box plot is a way of summarizing a set of data measured on an interval scale. It is often used in explanatory data analysis. This type of graph is used to show the shape of the distribution, its central value, and its variability.

Violin Plot

A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.

Swarm Plot

Draw a categorical scatterplot with non-overlapping points.

A swarm plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

With Box

With violin

Strip Plot

Draw a scatterplot where one variable is categorical.

A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

With Box

With violin

Joint Plot

By default, Joint Plot uses Scatter Plot and Histogram. Joint Plot can also display data using Kernel Density Estimate (KDE) and Hexagons. We can also draw a Regression Line in Scatter Plot.

Funnel Plot

Funnel charts are often used to represent data in different stages of a business process. It’s an important mechanism in Business Intelligence to identify potential problem areas of a process. For example, it’s used to observe the revenue or loss in a sales process for each stage, and displays values that are decreasing progressively. Each stage is illustrated as a percentage of the total of all values.

Correlation Plot

Correlation plot compute a correlation, which is used to investigate the dependence between multiple variables at the same time. it shows the relation of each variable to the others.

Cluster Map

Plot a matrix dataset as a hierarchically-clustered heatmap.

Pair Plot

A “pairs plot” is also known as a scatterplot, in which one variable in the same data row is matched with another variable's value, like this: Pairs plots are just elaborations on this, showing all variables paired with all the other variables, like this: The popular stats package R will do these effortlessly.

Area Plot

Stacked Area Graphs work in the same way as simple Area Graphs do, except for the use of multiple data series that start each point from the point left by the previous data series.

The entire graph represents the total of all the data plotted. Stacked Area Graphs also use the areas to convey whole numbers, so they do not work for negative values. Overall, they are useful for comparing multiple variables changing over an interval.

Venn Plot

A Venn Diagram is a diagram that visually displays all the possible logical relationships between a collection of sets. Each set is typically represented with a circle.

Contained within each set is a collection of objects or entities that all have something in common. When sets overlap, it’s known as the intersection area. This is where entities that have all the qualities of the overlapping sets.

Facet Grid

Multi-plot grid.

Columns

Rows

HUE

Rose Plot

Also known as a Coxcomb Chart, Polar Area Diagram.

This chart was famously used by statistician and medical reformer, Florence Nightingale to communicate the avoidable deaths of soldiers during the Crimean war.

Nightingale Rose Charts are drawn on a polar coordinate grid. Each category or interval in the data is divided into equal segments on this radial chart. How far each segment extends from the centre of the polar axis depends on the value it represents. So each ring from the centre of the polar grid can be used as a scale to plot the segment size and represent a higher value. Therefore, it’s important to notice with Nightingale Rose Charts that it’s the area, rather than the radius of a segment that represents its value.

The major flaw with Nightingale Rose Charts is that the outer segments are given more emphasis because of their larger area size. This disproportionately represents increases in value.

Radar Plot

Radar Charts are a way of comparing multiple quantitative variables. This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable. Radar Charts are also useful for seeing which variables are scoring high or low within a dataset, making them ideal for displaying performance.

Heat Map

A heat map is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.

Choropleth Map

A choropleth map is a type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income.

Orthographic projection

Natural earth projection

Mercator projection

Density Map

Draws a bivariate kernel density estimation with a Gaussian kernel from lon and lat coordinates and optional z values using a colorscale.

Table Plot

Tables in Python with Plotly.

go.Table provides a Table object for detailed data viewing. The data are arranged in a grid of rows and columns. Most styling can be specified for header, columns, rows or individual cells. Table is using a column-major order, ie. the grid is represented as a vector of column vectors.

Word Cloud

A visualisation method that displays how frequently words appear in a given body of text, by making the size of each word proportional to its frequency. All the words are then arranged in a cluster or cloud of words. Alternatively, the words can also be arranged in any format: horizontal lines, columns or within a shape.

Animation Plots

Bar Plot

Scatter Plot

Chorolopleth

Bubble Plot